convergence error
- North America > Canada > Alberta (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
- North America > United States > New York > Erie County > Buffalo (0.04)
Will Bilevel Optimizers Benefit from Loops
Bilevel optimization has arisen as a powerful tool for solving a variety of machine learning problems. Two current popular bilevel optimizers AID-BiO and ITD-BiO naturally involve solving one or two sub-problems, and consequently, whether we solve these problems with loops (that take many iterations) or without loops (that take only a few iterations) can significantly affect the overall computational efficiency. Existing studies in the literature cover only some of those implementation choices, and the complexity bounds available are not refined enough to enable rigorous comparison among different implementations. In this paper, we first establish unified convergence analysis for both AID-BiO and ITD-BiO that are applicable to all implementation choices of loops. We then specialize our results to characterize the computational complexity for all implementations, which enable an explicit comparison among them. Our result indicates that for AID-BiO, the loop for estimating the optimal point of the inner function is beneficial for overall efficiency, although it causes higher complexity for each update step, and the loop for approximating the outer-level Hessian-inverse-vector product reduces the gradient complexity. For ITD-BiO, the two loops always coexist, and our convergence upper and lower bounds show that such loops are necessary to guarantee a vanishing convergence error, whereas the no-loop scheme suffers from an unavoidable non-vanishing convergence error. Our numerical experiments further corroborate our theoretical results.
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Europe > Denmark (0.04)
Many thanks to the reviewers for their deep, thoughtful reviews and constructive suggestions
We note that despite very recent observations on empirical superiority of adaptive synchronization (e.g., Surely, it would be interesting to see if our bound can be tightened. R1. log T communication rounds clarification: However, for local SGD with periodic averaging the proof techniques are more involved. We do not tune the learning rate.
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
- North America > Canada > Alberta (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
- North America > United States > New York > Erie County > Buffalo (0.04)
Will Bilevel Optimizers Benefit from Loops
Bilevel optimization has arisen as a powerful tool for solving a variety of machine learning problems. Two current popular bilevel optimizers AID-BiO and ITD-BiO naturally involve solving one or two sub-problems, and consequently, whether we solve these problems with loops (that take many iterations) or without loops (that take only a few iterations) can significantly affect the overall computational efficiency. Existing studies in the literature cover only some of those implementation choices, and the complexity bounds available are not refined enough to enable rigorous comparison among different implementations. In this paper, we first establish unified convergence analysis for both AID-BiO and ITD-BiO that are applicable to all implementation choices of loops. We then specialize our results to characterize the computational complexity for all implementations, which enable an explicit comparison among them.
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Europe > Denmark (0.04)
Improved Quantization Strategies for Managing Heavy-tailed Gradients in Distributed Learning
Yan, Guangfeng, Li, Tan, Xiao, Yuanzhang, Hou, Hanxu, Song, Linqi
Gradient compression has surfaced as a key technique to address the challenge of communication efficiency in distributed learning. In distributed deep learning, however, it is observed that gradient distributions are heavy-tailed, with outliers significantly influencing the design of compression strategies. Existing parameter quantization methods experience performance degradation when this heavy-tailed feature is ignored. In this paper, we introduce a novel compression scheme specifically engineered for heavy-tailed gradients, which effectively combines gradient truncation with quantization. This scheme is adeptly implemented within a communication-limited distributed Stochastic Gradient Descent (SGD) framework. We consider a general family of heavy-tail gradients that follow a power-law distribution, we aim to minimize the error resulting from quantization, thereby determining optimal values for two critical parameters: the truncation threshold and the quantization density. We provide a theoretical analysis on the convergence error bound under both uniform and non-uniform quantization scenarios. Comparative experiments with other benchmarks demonstrate the effectiveness of our proposed method in managing the heavy-tailed gradients in a distributed learning environment.
- Asia > China > Hong Kong (0.05)
- North America > United States > Hawaii (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)